home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Chip: Internet
/
Chip Internet.iso
/
wwwutil
/
hotjava.ins
/
hotjava.exe
/
hotjava
/
classsrc
/
browser
/
tools
/
JavaSearch
/
README
< prev
next >
Wrap
Text File
|
1995-08-11
|
9KB
|
215 lines
The JavaSearch toolkit @(#)README 1.7 95/03/20 -*- Text -*-
David Brown, Sun Microsystems Inc., November 1994
JavaSearch is a collection of classes used to CREATE and SEARCH
inverted-index text databases. JavaSearch is used in a two-step
process:
(1) Use the "javaindex" program to build an JavaSearch database for a
collection of documents.
(2) Use the Database, Searcher, DocList and Doc classes in YOUR
application to search an JavaSearch database.
Here's the details on the two steps:
----- (1) Creating an JavaSearch database -----
Use the javaindex program. Usage is:
java javaindex -db database_name [-trimprefix path_to_trim]
[-fileprefix path_prefix] [-urlprefix url_prefix]
[-description "Description of this database"]
filename filename ...
Where:
database_name is used to construct the 5 filenames which will
be created by javaindex, and is the name you need to know when
you later want to *search* the database. See Database.java
for info on the 5 filenames.
database_name can either be relative to the current directory,
or an absolute path. For example, using the database name
/foo/bar/databases/JAVASPEC
will cause the files JAVASPEC.dbinfo, JAVASPEC.index,
JAVASPEC.qindex, JAVASPEC.docs, and JAVASPEC.docindex to all be
created in the directory "/foo/bar/databases".
path_to_trim is a string that should be trimmed off the BEGINNING of
every filename we index, before saving a Doc object for that
filename in the Database we're creating. This is important
because URLs are constructed by concatenating the DATABASE's URL
prefix with the DOCUMENT's filename!
So if you're indexing the files /foo/bar/baz/*.html, and these
files happen to be accessible by URLs like
"http://tachyon.eng/baz/*.html", you would TRIM /foo/bar/baz/
and use a urlprefix of http://tachyon.eng/baz/.
Here's a real-world example, for the Java spec:
java javaindex -db JAVA_SPEC \
-trimprefix /net/tachyon/export/disk1/Mosaic/docs/spec/ \
-urlprefix http://tachyon.eng/spec/ \
/net/tachyon/export/disk1/Mosaic/docs/spec/*.html
path_prefix is the string that should be prepended to the "filename"
of each individual Doc in this database, to construct a valid
fully-qualified pathname. For example, you might index all the
files in the current directory like this:
java javaindex -db FOO -fileprefix /full/path/to/this/dir/ *
url_prefix is the string that should be prepended to the "filename"
of each individual Doc in this database, to construct a valid
URL pointing to that document. See above for an example.
url_prefix and path_prefix may be used together, although in
general a Database will either be full of HTML files (in which
case they are always going to be accessed as URLs) or full of
plain text files (in which case path_prefix should be used,
since the documents will be read as regular files).
description is a human-readable description of this database.
filename ... This is the list of documents to index. These
filenames must be either absolute pathnames, or relative to the
current directory, although remember the "path_to_trim" string
is stripped off all filenames before they're stored in the
Database.
Sorry if the path_to_trim/path_prefix/url_prefix stuff is confusing;
note that WAIS does it the same way, though. Look for my notes on
"WAIS's URL type" in /net/barchetta/opt/wais/README.
Javaindex prints out a bunch of useful statistics when it finished
creating a database.
----- (2) Searching an JavaSearch database from your program -----
First of all, look at the program "javasearch.java": this is a very
simple command-line interface to perform searches on an JavaSearch
database. It demonstrates how to open a database, do a search, and
look at the results.
In a nutshell, you do the following to perform a search:
Database db = Database.OpenDatabase(database_name);
Searcher searcher = new Searcher(db);
DocList resultList = searcher.doSearch(query_string);
Now, use the DocList.getDocAt() method to look at the individual Doc
objects in the result list. For any Doc, you can get the headline
(doc.headline), or a URL for the Doc (db.docURLPrefix+doc.filename),
or a full pathname for the Doc (db.docPathPrefix + doc.filename).
A Query string is a simple boolean expression, like (for example)
"method and matching". Boolean operators "and", "or" and "not" are
allowed. Precedence is a trivial left-to-right evaluation;
parentheses are not supported. See the big doc comment for the
doSearch() method in Searcher.java for all the details.
That's all there is. A typical "Searching app" or applet might give
the user text entry fields to select a database name and a query
string, then show the result headlines in a scrolling list, and then
have HotJava open the URL of any Doc which the user clicks on.
-----
See the detailed comment at the end of Database.java for a description
of the Files used by JavaSearch, and an overview of most of the
classes.
-----
EXAMPLES:
(1) Creating and searching a database of the Java language spec:
cd livejava/src/share/contrib/JavaSearch (eventually the JavaSearch
stuff should be in a package!)
java -cs javaindex -db /tmp/JAVA_SPEC -trimprefix /net/tachyon/export/disk1/Mosaic/docs/spec/ -urlprefix http://tachyon.eng/spec/ /net/tachyon/export/disk1/Mosaic/docs/spec/*.html
[This builds the database, in /tmp.]
[Now search for "method and matching":]
java -cs javasearch /tmp/JAVA_SPEC method and matching
[Look at the results. Selecting a document to view isn't too
useful here, since these documents have no valid filenames!
They're designed to be accessed by their URLs.]
(2) Creating a database of random text files, for example RFCs:
java -cs javaindex -db /tmp/Patent-stuff -trimprefix /usr/green/doc/Patents-Original/ -fileprefix /usr/green/doc/Patents-Original/ /usr/green/doc/Patents-Original/*.txt
[This takes a few minutes. (The indexer really needs the 'btree
optimization' (see below)!)]
[Now search:]
java -cs javasearch /tmp/Patent-stuff geographic and navigation
java -cs javasearch /tmp/Patent-stuff touch and interface not speech
[Once you see the results, type a document number to
javasearch's prompt, and javasearch will display that file.]
-----
RESTRICTIONS:
Here's a list of things that many other text search/retrieval systems
can do that JavaSearch can't. Some of these might be important to add
at some point.
- When indexing, each input file is treated as a document. Some other
systems let you have multiple documents per file.
- The searcher is missing numerous features found in other info
retrieval systems, such as: relevance ranked (or "weighted")
results; stopwords; synonyms, word stemming and searches like "foo*"
for all words beginning with "foo"; literal searches (like "method
overloading"; full parsing of a hierarchical boolean query (with
parentheses for precedence grouping); word proximity searching
("method near matching"); and many others.
Some of these features would require significant changes to
JavaSearch's architecture, some would require slight changes in the
index format, and others could be implemented by changing only the
Searcher class.
- The indexer only currently knows about plain text or HTML files.
The code does have an *outline* for recognizing News articles --
just grep for "NEWS" to find all the places you need to change to
add a new doc type.
- The indexer is wildly suboptimal in how it keeps the Word objects in
memory while building an index -- it *should* use a btree, but
instead keeps Words in an unsorted Vector! Sorry, I didn't get
around to writing a btree utility class. But this only makes
indexing slow; it has no effect on Searching performance. And
still, the indexer only takes a couple of minutes for small
databases like the Java documentation...
- Javaindex should have a "recursively index directories" feature.
This would work by having a command-line arg ("-R"?) that told
javaindex that each "filename" argument was really a directory,
and that it should index ALL files in that directory's hierarchy.
This is the only reasonable way to index very large databases, like
for example a news spool filesystem (like the fp.* groups).
- There's a whole bunch of other features which would be nice to have,
but I haven't had time to implement. Look for "REMIND" comments in
the JavaSearch code to find lots of notes like this.